CRUXEval-input

pairwise wins

p-values

result table

model pass1 win_rate elo
0 gpt-4-0613+cot 0.755 0.947 1540.237
1 gpt-4-turbo-2024-04-09+cot 0.757 0.865 1380.317
2 gpt-3.5-turbo-0613+cot 0.503 0.765 1259.673
3 gpt-4-0613 0.698 0.717 1203.368
4 claude-3-opus-20240229+cot 0.734 0.685 1174.246
5 gpt-4-turbo-2024-04-09 0.685 0.683 1174.064
6 codellama-34b+cot 0.501 0.675 1173.731
7 codellama-13b+cot 0.474 0.604 1115.937
8 claude-3-opus-20240229 0.642 0.554 1068.591
9 codellama-7b+cot 0.404 0.542 1064.422
10 codetulu-2-34b 0.492 0.525 1049.250
11 codellama-34b 0.472 0.509 1036.052
12 deepseek-base-33b 0.465 0.489 1022.381
13 deepseek-instruct-33b 0.465 0.465 1002.395
14 gpt-3.5-turbo-0613 0.490 0.461 1000.000
15 codellama-python-34b 0.439 0.457 998.455
16 phind 0.472 0.450 993.536
17 codellama-13b 0.425 0.442 989.942
18 deepseek-base-6.7b 0.419 0.438 985.885
19 mixtral-8x7b 0.393 0.411 965.541
20 codellama-python-13b 0.397 0.411 965.146
21 magicoder-ds-7b 0.417 0.379 939.551
22 wizard-34b 0.427 0.376 937.456
23 codellama-python-7b 0.373 0.341 912.804
24 codellama-7b 0.360 0.327 901.755
25 mistral-7b 0.350 0.317 894.167
26 deepseek-instruct-6.7b 0.374 0.297 877.088
27 wizard-13b 0.365 0.295 874.113
28 phi-2 0.316 0.293 873.544
29 starcoderbase-16b 0.313 0.280 863.138